Regular Expression Notes

Must Watch!



MustWatch

regex regexr


Capturing group

(abc){3} matches abcabcabc. (這個){2} matches 這個這個. keywordList = gsub('(\\d{5})', paste0("<span onclick=\'xunbao\\(\"", "\\1\"", "\\)\'>", "\\1", "</span>"), keywordList)

Capturing Groups, Non-Captured Group

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. (ssssss) non-capturing group A non-capturing group is to group a set of characters without capturing the matched text. Non-capturing groups tells the engine not to store the matched text in a separate memory slot. syntax: (?:expression) The (?:) syntax denotes a non-capturing group, and expression represents the regular expression pattern to be matched. capturing group example: codechiname = "asdfghjkl" gsub('(^....).*', '\\1', codechiname) "asdf" (^....) is the capturing group and is remembered, \\1 is to call out the remembered group non-capturing group example: Regex Code: (?:animal)(?:=)(\w+)(,)\1\2 Search String: Line 1 - animal=cat,dog,cat,tiger,dog Line 2 - animal=cat,cat,dog,dog,tiger Line 3 - animal=dog,dog,cat,cat,tiger (?:animal) --> Non-Captured Group 1 (?:=)--> Non-Captured Group 2 (\w+)--> Captured Group 1 (,)--> Captured Group 2 \1 - captured group 1 In Line 1 is cat, In Line 2 is cat, In Line 3 is dog. \2 - captured group 2 comma (,) So in this code by giving \1 and \2 we recall or repeat the result of captured group 1 and 2 respectively later in the code. As per the order of code (?:animal) should be group 1 and (?:=) should be group 2 and continues.. but by giving the ?: we make the match-group non captured (which do not count off in matched group, so the grouping number starts from the first captured group and not the non captured), so that the repetition of the result of match-group (?:animal) can't be called later in code. Groups that capture you can use later on in the regex to match OR you can use them in the replacement part of the regex. Making a non-capturing group simply exempts that group from being used for either of these reasons. Non-capturing groups are great if you are trying to capture many different things and there are some groups you don't want to capture. Thats pretty much the reason they exist. While you are learning about groups, learn about Atomic Groups, they do a lot! There is also lookaround groups but they are a little more complex and not used so much. Example of using later on in the regex (backreference): <([A-Z][A-Z0-9]*)\b[^>]*>.*? [ Finds an xml tag (without ns support) ] ([A-Z][A-Z0-9]*) is a capturing group (in this case it is the tagname) Later on in the regex is \1 which means it will only match the same text that was in the first group (the ([A-Z][A-Z0-9]*) group) (in this case it is matching the end tag). To explain its significance pertaining to JavaScript. Consider a scenario where you want to match cat is animal when you would like match cat and animal and both should have a is in between them. // this will ignore "is" as that's is what we want "cat is animal".match(/(cat)(?: is )(animal)/) ; result ["cat is animal", "cat", "animal"] // using lookahead pattern it will match only "cat" we can // use lookahead but the problem is we can not give anything // at the back of lookahead pattern "cat is animal".match(/cat(?= is animal)/) ; result ["cat"] //so I gave another grouping parenthesis for animal // in lookahead pattern to match animal as well "cat is animal".match(/(cat)(?= is (animal))/) ; result ["cat", "cat", "animal"] // we got extra cat in above example so removing another grouping "cat is animal".match(/cat(?= is (animal))/) ; result ["cat", "animal"]

not containing </a>

not containing 這個 ^((?!這個).)*$ //looking for something NOT precede by 這個 test one two a這個s three not include \t ^[^\t]*$ anything not followed by tab ^((?!\t).)*$ span not followed by / <span class="brown">((?!/).)*$ not include </a> ^((?!</a>).)*$ Finding tags Not Containing img <[^img].+?> to find all instances of "foo" not either preceded by a "." or succeeded by a "/".

Lookahead and Lookbehind

(?=subexp) look-ahead (?<=subexp) look-behind (?!subexp) negative look-ahead (?<!subexp) negative look-behind Lookahead seeks following string is foo (?=foo) Lookbehind seeks preceding string is foo (?<=foo) Negative Lookahead seeks following string is not foo (?!foo) Negative Lookbehind seeks preceding string is not foo (?<!foo) test one two a這個s three look這個look這個look這個 (?=這個) look-ahead (?<=這個) look-behind (?!這個) negative look-ahead (?<!這個) negative look-behind look-ahead looking for something follow by 這個, cursor place at before 這個 .(?=這個) look-behind looking for something precede by 這個, cursor place at behind 這個 (?<=這個). negative look-ahead looking for something NOT follow by 這個, . (?!這個) cursor place at before s without 這個 negative look-behind looking for something NOT precede by 這個 (?<!這個) . (?<!\.)foo(?!/) The ^ inside square brackets negates the expression. So to find a "foo" not preceded by a "." would be: [^.]foo <[^img] [^0-9\r\n] matches any character that is not a digit or a line break. q[^u] means: "a q followed by a character that is not a u" Negated Character Classes “And” in regular expressions `&&`

Grouping

(x) Matches x and remembers the match. These are called capturing groups. For example, /(foo)/ matches and remembers "foo" in "foo bar". The capturing groups are numbered according to the order of left parentheses of capturing groups, starting from 1. The matched substring can be recalled from the resulting array's elements [1], ..., [n] or from the predefined RegExp object's properties $1, ..., $9. (?:x) Matches x but does not remember the match. These are called non-capturing groups. The matched substring cannot be recalled from the resulting array's elements [1], ..., [n] or from the predefined RegExp object's properties $1, ..., $9. Regular Expression examples Non-capturing groups Capturing Groups and Backreferences

Regular Expression Recipes

Regular Expression Recipes strip all HTML tags <(.|\n)+?> strip digits \d{1,3}.? strip digits with decimals (\d*\.)?\d+
Capturing group (regex) Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference. They allow you to apply regex operators to the entire grouped regex. (abc){3} matches abcabcabc. First group matches abc.
Capturing group \(regex\) Escaped parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference. They allow you to apply regex operators to the entire grouped regex. \(abc\){3} matches abcabcabc. First group matches abc.
Non-capturing group (?:regex) Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything. (?:abc){3} matches abcabcabc. No groups.
Backreference \1 through \9 Substituted with the text matched between the 1st through 9th numbered capturing group. (abc|def)=\1 matches abc=abc or def=def, but not abc=def or def=abc.
Backreference \10 through \99 Substituted with the text matched between the 10th through 99th numbered capturing group.
Backreference \k<1> through \k<99> Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=\k<1> matches abc=abc or def=def, but not abc=def or def=abc.
Backreference \k'1' through \k'99' Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=\k'1' matches abc=abc or def=def, but not abc=def or def=abc.
Backreference \g1 through \g99 Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=\g1 matches abc=abc or def=def, but not abc=def or def=abc.
Backreference \g{1} through \g{99} Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=\g{1} matches abc=abc or def=def, but not abc=def or def=abc.
Backreference \g<1> through \g<99> Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=\g<1> matches abc=abc or def=def, but not abc=def or def=abc.
Backreference \g'1' through \g'99' Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=\g'1' matches abc=abc or def=def, but not abc=def or def=abc.
Backreference (?P=1) through (?P=99) Substituted with the text matched between the 1st through 99th numbered capturing group. (abc|def)=(?P=1) matches abc=abc or def=def, but not abc=def or def=abc.
Relative Backreference \k<-1>, \k<-2>, etc. Substituted with the text matched by the capturing group that can be found by counting as many opening parentheses of named or numbered capturing groups as specified by the number from right to left starting at the backreference. (a)(b)(c)(d)\k<-3> matches abcdb.
Relative Backreference \k'-1', \k'-2', etc. Substituted with the text matched by the capturing group that can be found by counting as many opening parentheses of named or numbered capturing groups as specified by the number from right to left starting at the backreference. (a)(b)(c)(d)\k'-3' matches abcdb.
Relative Backreference \g-1, \g-2, etc. Substituted with the text matched by the capturing group that can be found by counting as many opening parentheses of named or numbered capturing groups as specified by the number from right to left starting at the backreference. (a)(b)(c)(d)\g-3 matches abcdb.
Relative Backreference \g{-1}, \g{-2}, etc. Substituted with the text matched by the capturing group that can be found by counting as many opening parentheses of named or numbered capturing groups as specified by the number from right to left starting at the backreference. (a)(b)(c)(d)\g{-3} matches abcdb.
Relative Backreference \g<-1>, \g<-2>, etc. Substituted with the text matched by the capturing group that can be found by counting as many opening parentheses of named or numbered capturing groups as specified by the number from right to left starting at the backreference. (a)(b)(c)(d)\g<-3> matches abcdb.
Relative Backreference \g'-1', \g'-2', etc. Substituted with the text matched by the capturing group that can be found by counting as many opening parentheses of named or numbered capturing groups as specified by the number from right to left starting at the backreference. (a)(b)(c)(d)\g'-3' matches abcdb.
Failed backreference Any numbered backreference Backreferences to groups that did not participate in the match attempt fail to match. (a)?\1 matches aa but fails to match b.
Invalid backreference Any numbered backreference Backreferences to groups that do not exist at all are valid but fail to match anything. (a)?\2|b matches b in aab.
Nested backreference Any numbered backreference Backreferences can be used inside the group they reference. (a\1?){3} matches aaaaaa.
Forward reference Any numbered backreference Backreferences can be used before the group they reference. (\2?(a)){3} matches aaaaaa.
FeatureSyntaxDescriptionExample

RegReplace User Guide

sublime RegReplace User Guide

create a regex macro example

create a regex macro example in sublime

extract inner text from anchor

to extract inner text from anchor tags repalce: (<[a|A][^>]*>|</[a|A]>) with: \n or ""

locate string with multi elements exist (AND)

百合、茯苓、玄参、乌药、泽泻、麦冬、当归、白术、茵陈、白芍、石斛、九节菖蒲、川芎、三七、地榆、延胡索、蒲黄、鸡内金 break and put in array, use and logic in R Check If a String Contains Multiple Keywords (?=.*百合)(?=.*鸡内金) () contains | OR operator

to locate the tabs in a string

\t.*\t to locate the second tab in a string ^.*\t.*\t ^.*\t matches everything that precedes a tab and second part is the 2nd tab ^.*\t locate the last tab (^.*?)(\t) locate the first tab

Matching only the first occurrence

The matching pattern could be: ^[^,]+ That means ^ starts with [^,] anything but a comma + repeated one or more times (use * (means zero or more) if the first field can be empty)

Matching the second occurrence

a href="https://amzn.to/2QjGNZb" rel="nofollow" target="_blank" href="[^"]+" That means href=" starts with [^"] anything but a " + repeated one " followed by "

find lines that does not contain tab

match line without tab in whole line ^[^\t].*$ ^[^\t]+$ match line with 2 tabs in whole line ^(.*\t){2} \t.*\t using capturing groups to replace img.*> by img.*><br> keywordList = gsub('(img.*?>)', '\\1<br>', keywordList) remember to use () to capture the group

Numbered Backreferences

named or numbered capturing groups can reinsert the text matched by any of those capturing groups in the replacement text. As a simple example, the regex \*(\w+)\* matches a single word between asterisks, storing the word in the first (and only) capturing group. The replacement text <b>\1</b> replaces each regex match with the text stored by the capturing group between bold tags. Effectively, this search-and-replace replaces the asterisks with bold tags, leaving the word between the asterisks in place. Replacing *word* as a whole with <b>word</b> is far easier and far more efficient than trying to come up with a way to correctly replace the asterisks separately. The \1 syntax for backreferences in the replacement text is borrowed from the syntax for backreferences in the regular expression. \1 through \9 are supported by the JGsoft applications, Delphi, Perl (though deprecated), Python, Ruby, PHP, R, Boost, and Tcl. Double-digit backreferences \10 through \99 are supported by the JGsoft applications, Delphi, Python, and Boost. If there are not enough capturing groups in the regex for the double-digit backreference to be valid, then all these flavors treat \10 through \99 as a single-digit backreference followed by a literal digit. The flavors that support single-digit backreferences but not double-digit backreferences also do this. $1 through $99 for single-digit and double-digit backreferences are supported by the JGsoft applications, Delphi, .NET, Java, JavaScript, VBScript, PCRE2, PHP, Boost, std::regex, and XPath. These are also the variables that hold text matched by capturing groups in Perl. If there are not enough capturing groups in the regex for a double-digit backreference to be valid, then $10 through $99 are treated as a single-digit backreference followed by a literal digit by all these flavors except .NET, Perl, PCRE2, and std::regex.. Putting curly braces around the digit ${1} isolates the digit from any literal digits that follow. This works in the JGsoft applications, Delphi, .NET, Perl, PCRE2, PHP, Boost, and XRegExp.

Named Backreferences

If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?'name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .NET, PCRE2, Java 7, and XRegExp. PCRE2 also supports $name without the curly braces. In Perl 5.10 and later you can interpolate the variable $+{name}. Boost too uses $+{name} in replacement strings. ${name} does not work in any version of Perl. $name is unique to PCRE2. In Python, if you have the regex (?P<name>group) then you can use its match in the replacement text with \g<name>. This syntax also works in the JGsoft applications and Delphi. Python and the JGsoft applications, but not Delphi, also support numbered backreferences using this syntax. In Python this is the only way to have a numbered backreference immediately followed by a literal digit. PHP and R support named capturing groups and named backreferences in regular expressions. But they do not support named backreferences in replacement texts. You’ll have to use numbered backreferences in the replacement text to reinsert text matched by named groups. To determine the numbers, count the opening parentheses of all capturing groups (named and unnamed) in the regex from left to right.

Regular Expressions Cheat Sheet

A regular expression specifies a set of strings that matches it. This cheat sheet is based off Python 3's Regular Expressions (http://docs.python.org/3/library/re.html) but is designed for searches within Sublime Text. Special Characters \ Escapes special characters or signals a special sequence. . Matches any single character except a newline. ^ Matches the start of the string. $ Matches the end of the string. * Greedily matches 0 or more repetitions of the preceding RE. *? Matches 0 or more repetitions of the preceding RE. + Greedily matches 1 or more repetitions of the preceding RE. +? Matches 1 or more repetitions of the preceding RE. ? Greedily matches 0 or 1 repetitions of the preceding RE. ? means that character is optional ?? Matches 0 or 1 repetitions of the preceding RE. A|B Matches A, if A is unmatched then matches B, where A and B are arbitrary REs. {m} Matches exactly m many repetitions of the previous RE. {m,n} Greedily matches from m many to n many repetitions of the previous RE. {m,n}? Matches m many to n many repetitions of the previous RE. [...] Indicates a set of characters to match. [amk] Matches 'a', 'm', or 'k'. [a-z] Matches 'a' through 'z'. [a-f0-7] Matches 'a' through 'f' or '0' through '7'. [a\-z] Matches 'a', '-', or 'z'. [a-] Matches 'a' or '-'. [-a] Matches 'a' or '-'. [(+*)] Matches '(', '+', '*', or ')'. [] matches special characters literally. [\w] Matches the character class for '\w'. See character classes. [^5] Matches anything other than '5'. '^' forms the complementary set only as the first character in a set. []()] Matches ']', '(', and ')'. ']' is taken literally only as the first character in a set. [()\]] Matches ']', '(', and ')'. (...) Matches the RE inside the parenthesis and assigns a new group. (?P<name>...) The RE matched is accessible by the group indicated by name. (?...) Extension notation which changes a RE's behavior. These do not assign a new group. (?aiLmsux) Sets the corresponding flag to each letter. Does not work within Sublime Text. (?:...) A non-capturing version of parenthesis. The matched substring cannot be retrieved later. (?P=name) Matches the substring matched by the group named name. (?#...) A comment, the contents are ignored. (?=...) Lookahead assertion, the preceding RE only matches if this matches. (?!...) Negative lookahead assertion, the preceding RE only matches if this doesn't match. (?<=...) Positive lookbehind assertion, the following RE will only match if preceded with this fixed length RE. (?<!...) Negative lookbehind assertion, the following RE will only match if not preceded with this fixed length RE. (?(id)true|false) If group id exists then uses the true RE, else use the false RE. > Character classes \1 Matches the contents of the group labelled by the same number. Acceptable numbers are 1-99. \A Matches at the start of the current string. \b Matches the empty string at the beginning or end of a word. \b matches the boundary between \w and \W. \B Matches the empty string not at the beginning or end of a word. \d Matches any Unicode decimal digit, including 0-9. \D Matches any Unicode non-decimal digit. \s Matches any Unicode whitespace character, including ' ', \t, \n, \r, \f and \v. \S Matches any Unicode non-whitespace character. \w Matches any Unicode word character, including a-z, A-Z, and 0-9. \W Matches any Unicode non-word character. \Z Matches at the end of the string. \a Matches the ASCII Bell (). \f Matches the ASCII Formfeed ( ). \n Matches the ASCII Linefeed. \r Matches the ASCII Carriage Return (). \t Matches the ASCII Horizontal Tab. \v Matches the ASCII Vertical Tab ( ). To do a replace on groups, so something like converting this text:
Hello my name is bob And this search term: Find what: my name is (\w)+ Replace with: my name used to be $(1) The search term works just fine but I can't figure out a way to actually do a replace using the regexp group.
Usually a back-reference is either $1 or \1 (backslash one) for the first capture group (the first match of a pattern in parentheses). So maybe try: my name used to be \1 or my name used to be $1 UPDATE: As several people have pointed out, your original capture pattern is incorrect and will only capture the final letter of the name rather than the whole name. You should use the following pattern to capture all of the letters of the name: my name is (\w+) Find part: my name is (\w)+ With replace part: my name used to be \1 Would return: Hello, my name used to be b Change find part to: my name is (\w+) And replace will be what you expect: Hello, my name used to be bob While (\w)+ will match "bob", it is not the grouping you want for replacement. Use the ( ) parentheses in your search string There is an important thing to emphasize! All the matched segments in your search string that you want to use in your replacement string must be embraced by ( ) parentheses, otherwise these matched segments won't be reachable with variables such as $1, $2,...nor \1, \2,.. and etc. EXAMPLE: We want to replace 'em' with 'px' but preserve the number values: margin: 10em margin: 2em So we use the margin: $1px as the replacement string. CORRECT: Embrace the desired $1 matched segment by ( ) parentheses as following: FIND: margin: ([0-9]*)em (With parentheses) REPLACE TO: margin: $1px RESULT: margin: 10px margin: 2px WRONG: The following regex pattern will match the desired lines but matched segments will not be available in replaced string as variables such as $1: FIND: margin: [0-9]*em (Without parentheses) REPLACE TO: margin: $1px RESULT: ($1 is undefined) margin: px margin: px Note that if you use more than 9 capture groups you have to use the syntax ${10}. $10 or \10 or \{10} will not work.

to match a line that doesn't contain a word

Regular expression to match a line that doesn't contain a word

non-capturing groups (?:)

some numbers could be written as 1st, 2nd, 3rd, 4th,... to capture the numeric part, but not the (optional) suffix, use a non-capturing group. ([0-9]+)(?:st|nd|rd|th)? That will match numbers in the form 1, 2, 3... or in the form 1st, 2nd, 3rd,... but it will only capture the numeric part. Without non-capturing group, I could do: ([0-9]+)(st|nd|rd|th)?? With \1 I have the number, no ?: needed. ?: is used when you want to group an expression, but you do not want to save it as a matched/captured portion of the string. An example would be something to match an IP address: /(?:\d{1,3}\.){3}\d{1,3}/ Note that I don't care about saving the first 3 octets, but the (?:...) grouping allows me to shorten the regex without incurring the overhead of capturing and storing a match. Consider the following text: http://stackoverflow.com/ https://stackoverflow.com/questions/tagged/regex if apply the regex below over it... (https?|ftp)://([^/\r\n]+)(/[^\r\n]*)? get the following group results: Match "http://stackoverflow.com/" Group 1: "http" Group 2: "stackoverflow.com" Group 3: "/" Match "https://stackoverflow.com/questions/tagged/regex" Group 1: "https" Group 2: "stackoverflow.com" Group 3: "/questions/tagged/regex" But I don't care about the protocol -- I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:). (?:https?|ftp)://([^/\r\n]+)(/[^\r\n]*)? Now, my result looks like this: Match "http://stackoverflow.com/" Group 1: "stackoverflow.com" Group 2: "/" Match "https://stackoverflow.com/questions/tagged/regex" Group 1: "stackoverflow.com" Group 2: "/questions/tagged/regex" The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.

regexp Multibyte

regular-expressions unicode regexp unicode Multibyte <- "Sungpil_한성필_韓盛弼_Han" Multibyte, perl=TRUE \\p{Hangul} gsub("\\p{Han}+", "",Multibyte, perl=TRUE) Regular expression to match non-ASCII characters [^\x00-\x7F]+ It matches any character which is not contained in the ASCII character set (0-127, i.e. 0x0 to 0x7F). do the same thing with Unicode: [^\u0000-\u007F]+ http://www.unicode.org/charts/ Unicode Code Charts With Unicode Property Escapes match a letter from any language with the following simple regular expression: /\p{Letter}/u Or with the shorthand, even terser: /\p{L}/u Matching Words to match letters together with other word-characters like hyphens: /[\p{L}-]/u Stitching it all together, you could match words of all[1] languages with this beautifully short RegEx: /[\p{L}-]+/ug

find Last Occurrence of comma

find Last Occurrence of comma itemList = gsub("^.*[,]", "", itemList)